Air Quality and Health Impact Analysis — Lab Assessment Report

  1. Dataset Description Source of Dataset: The dataset used in this project is sourced from the World Health Organization (WHO) Global Health Observatory under the Ambient Air Pollution (PM2.5) Dataset, 2022. It was downloaded as a CSV file and imported into R for analysis.

Description: The dataset contains information on fine particulate matter (PM2.5) concentrations across different countries, residence types (City, Rural, Urban, Total), and regions for multiple years. It is a reliable global dataset that helps study air quality and its potential health impact.

Structure:

Number of Records: 9,450 Number of Columns: 34

Main Attributes Used:

Region: Geographic grouping (e.g., Africa, Americas, Europe) Country: Country name Residence Type: Cities, Rural, Urban, Total Year (Period): Year of measurement FactValueNumeric (PM2.5): Annual mean concentration (µg/m³) Derived Fields: Health Impact Score, Health Category, Latitude, Longitude

Type of Dataset:

Format: CSV (Real-world dataset) Source: WHO Global Health Observatory Purpose: To analyze global air quality and assess its health impact through interactive visualizations.

  1. Code Implementation

Below is an overview of the code implemented for the interactive visualizations. The full code was written and executed in R Markdown, using Plotly, Leaflet, and Rbokeh libraries.

library(readr)
## Warning: package 'readr' was built under R version 4.3.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.3.3
library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(leaflet)
library(RColorBrewer)
library(countrycode)
## Warning: package 'countrycode' was built under R version 4.3.3
library(rnaturalearth)
library(sf)
## Warning: package 'sf' was built under R version 4.3.3
## Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.3.3
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(DT)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
## 
##     col_factor
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
data_path <- "data2.csv"
stopifnot(file.exists(data_path)) # stops if file not found
getwd()
## [1] "C:/Users/LENOVO/Documents/SEM07/pds/Air_Quality_and_Health_Impact_Analysis"
#Load and quick inspect

raw <- read_csv(data_path, guess_max = 5000)
glimpse(raw)
## Rows: 9,450
## Columns: 34
## $ IndicatorCode              <chr> "SDGPM25", "SDGPM25", "SDGPM25", "SDGPM25",…
## $ Indicator                  <chr> "Concentrations of fine particulate matter …
## $ ValueType                  <chr> "text", "text", "text", "text", "text", "te…
## $ ParentLocationCode         <chr> "AFR", "AMR", "EUR", "AMR", "AMR", "EUR", "…
## $ ParentLocation             <chr> "Africa", "Americas", "Europe", "Americas",…
## $ `Location type`            <chr> "Country", "Country", "Country", "Country",…
## $ SpatialDimValueCode        <chr> "KEN", "TTO", "GBR", "GRD", "BRA", "DNK", "…
## $ Location                   <chr> "Kenya", "Trinidad and Tobago", "United Kin…
## $ `Period type`              <chr> "Year", "Year", "Year", "Year", "Year", "Ye…
## $ Period                     <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2…
## $ IsLatestYear               <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
## $ `Dim1 type`                <chr> "Residence Area Type", "Residence Area Type…
## $ Dim1                       <chr> "Cities", "Rural", "Cities", "Total", "Town…
## $ Dim1ValueCode              <chr> "RESIDENCEAREATYPE_CITY", "RESIDENCEAREATYP…
## $ `Dim2 type`                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Dim2                       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Dim2ValueCode              <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ `Dim3 type`                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Dim3                       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Dim3ValueCode              <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ DataSourceDimValueCode     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ DataSource                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ FactValueNumericPrefix     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ FactValueNumeric           <dbl> 10.01, 10.02, 10.06, 10.08, 10.09, 10.12, 1…
## $ FactValueUoM               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ FactValueNumericLowPrefix  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ FactValueNumericLow        <dbl> 6.29, 7.44, 9.73, 7.07, 8.23, 9.37, 8.58, 9…
## $ FactValueNumericHighPrefix <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ FactValueNumericHigh       <dbl> 13.74, 12.55, 10.39, 13.20, 12.46, 10.97, 1…
## $ Value                      <chr> "10.01 [6.29-13.74]", "10.02 [7.44-12.55]",…
## $ FactValueTranslationID     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ FactComments               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Language                   <chr> "EN", "EN", "EN", "EN", "EN", "EN", "EN", "…
## $ DateModified               <dttm> 2022-08-11 18:30:00, 2022-08-11 18:30:00, …
DT::datatable(head(raw, 8), options = list(scrollX = TRUE))#show header sample
#Canonicalize columns

df <- raw %>%
rename(
Country = Location,
Region = ParentLocation,
ResidenceType = Dim1,
Year = Period,
PM25 = FactValueNumeric
) %>%
select(Country, Region, ResidenceType, Year, PM25, everything())#keep only rows with PM25 numeric
df <- df %>% filter(!is.na(PM25))
#quick checks

cat("Rows after filtering PM2.5:", nrow(df), "\n")
## Rows after filtering PM2.5: 9450
summary(df$PM25)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.59   11.92   19.57   23.54   30.98   97.49
DT::datatable(head(df %>% select(Country, Region, ResidenceType, Year, PM25), 20), options = list(scrollX = TRUE))
#Derive HealthImpactScore and HealthCategory

df <- df %>%
mutate(
# numeric Year
  
Year = as.integer(Year),
HealthImpactScore = case_when(
PM25 <= 5 ~ 1,
PM25 <= 15 ~ 2,
PM25 <= 25 ~ 3,
PM25 <= 35 ~ 4,
TRUE ~ 5
),
HealthCategory = case_when(
HealthImpactScore == 1 ~ "Low",
HealthImpactScore == 2 ~ "Moderate",
HealthImpactScore == 3 ~ "Sensitive groups",
HealthImpactScore == 4 ~ "Unhealthy",
HealthImpactScore == 5 ~ "Hazardous"
)
)
table(df$HealthCategory)
## 
##        Hazardous              Low         Moderate Sensitive groups 
##             1911               15             3429             2697 
##        Unhealthy 
##             1398
# Install required packages if missing

if (!require("rnaturalearth")) install.packages("rnaturalearth")
if (!require("rnaturalearthdata")) install.packages("rnaturalearthdata")
if (!require("sf")) install.packages("sf")
if (!require("countrycode")) install.packages("countrycode")

library(rnaturalearth)
library(rnaturalearthdata)
library(sf)
library(countrycode)
library(dplyr)

# Add country ISO3 codes
df <- df %>%
  mutate(iso3 = countrycode(Country, "country.name", "iso3c"))

# Load world map safely
world <- rnaturalearth::ne_countries(scale = "medium", returnclass = "sf")

# Compute centroids for each country polygon
centroids <- st_centroid(world) %>%
  cbind(st_coordinates(.)) %>%
  select(iso_a3, X, Y) %>%
  rename(iso3 = iso_a3, long = X, lat = Y)

# Merge with main dataframe
df <- df %>%
  left_join(centroids, by = "iso3")

# Handle missing coordinates with random jitter (for countries not found)
missing_geo <- which(is.na(df$lat) | is.na(df$long))
if (length(missing_geo) > 0) {
  set.seed(42)
  df$lat[missing_geo] <- runif(length(missing_geo), -10, 50)
  df$long[missing_geo] <- runif(length(missing_geo), -80, 100)
}

# Check if coordinates now exist
head(df %>% select(Country, lat, long))
## # A tibble: 6 × 3
##   Country                                                  lat   long
##   <chr>                                                  <dbl>  <dbl>
## 1 Kenya                                                  0.599  37.8 
## 2 Trinidad and Tobago                                   10.5   -61.3 
## 3 United Kingdom of Great Britain and Northern Ireland  54.0    -2.76
## 4 Grenada                                               12.1   -61.7 
## 5 Brazil                                               -10.6   -53.2 
## 6 Denmark                                               56.0    10.0

Visualizations:

#1.Histogram — PM2.5 distribution

plot_ly(df, x=~PM25, type='histogram', nbinsx = 40) %>%
layout(title = "PM2.5 distribution (WHO data)", xaxis=list(title="PM2.5 (µg/m³)"), yaxis=list(title="Count"))

Insight: Shows how global PM2.5 values are distributed. Most countries fall under low exposure, with a long left tail representing low-pollution zones.

#2.Top 30 countries by mean PM2.5 (bar chart)

top_countries <- df %>% group_by(Country) %>% summarise(mean_PM25 = mean(PM25, na.rm=TRUE), n = n()) %>% arrange(desc(mean_PM25)) %>% slice(1:30)
plot_ly(top_countries, x=~reorder(Country, mean_PM25), y=~mean_PM25, type='bar', marker=list(color=~mean_PM25, colorscale='YlOrRd')) %>%
layout(title="Top 30 countries by mean PM2.5", xaxis = list(title="", tickangle=-45), yaxis=list(title="Mean PM2.5"))

Insight: Highlights the most polluted countries globally, using YlOrRd color scale for intuitive severity display.

#3.PM2.5 by Residence Type (boxplot)

plot_ly(df, x=~ResidenceType, y=~PM25, type='box', color=~ResidenceType) %>%
layout(title="PM2.5 by Residence Type (Cities / Urban / Rural / Total)")

Insight: City and urban areas show higher median pollution levels.

#4.PM2.5 time trend (Yearly mean) — if multiple years present

if(sum(!is.na(df$Year))>0){
yearly <- df %>% group_by(Year) %>% summarise(mean_PM25 = mean(PM25, na.rm=TRUE))
plot_ly(yearly, x=~Year, y=~mean_PM25, type='scatter', mode='lines+markers') %>%
layout(title="Global yearly mean PM2.5 (WHO data)", xaxis=list(title="Year"), yaxis=list(title="Mean PM2.5"))
} else { cat("Year information not available for trend plot.") }

Insight: Displays global PM2.5 trends over time. Used to assess if pollution levels have decreased or stabilized in recent years.

#5.Map — mean PM2.5 by country (Leaflet)

map_data <- df %>% group_by(Country, lat, long) %>% summarise(mean_PM25 = mean(PM25, na.rm=TRUE), n = n()) %>% arrange(desc(mean_PM25))
pal <- colorNumeric("YlOrRd", domain = map_data$mean_PM25)
leaflet(map_data) %>% addTiles() %>%
addCircleMarkers(~long, ~lat, radius = ~pmin(15, mean_PM25/5), color = ~pal(mean_PM25),
stroke = FALSE, fillOpacity = 0.8,
popup = ~paste0("", Country, "Mean PM2.5: ", round(mean_PM25,1))) %>%
addLegend("bottomright", pal = pal, values = ~mean_PM25, title = "Mean PM2.5")

Insight: Provides a spatial overview of pollution hotspots; larger red markers indicate higher PM2.5.

#6.Leaflet multi-layer view (ResidenceType filter via groups)

leaflet() %>% addTiles() %>%
{ m <- .
for(rt in unique(na.omit(df$ResidenceType))){
sub <- df %>% filter(ResidenceType == rt) %>% group_by(Country, lat, long) %>% summarise(mean_PM25 = mean(PM25, na.rm=TRUE))
m <- m %>% addCircleMarkers(data=sub, lng=~long, lat=~lat, radius=~pmin(12, mean_PM25/6), color=~colorNumeric("YlOrRd", sub$mean_PM25)(sub$mean_PM25), group = rt,
popup = ~paste0("", Country, "Residence: ", rt, "Mean PM2.5: ", round(mean_PM25,1)))
}
m %>% addLayersControl(overlayGroups = unique(na.omit(df$ResidenceType)), options = layersControlOptions(collapsed=FALSE))
}

Insight: Enables filtering of pollution levels by residence category — interactive comparison between population zones.

#7.PM2.5 vs HealthImpactScore (scatter plot)

plot_ly(df, x=~PM25, y=~HealthImpactScore, color=~HealthCategory, text=~paste("Country:", Country, "Res:", ResidenceType), type='scatter', mode='markers') %>%
layout(title="PM2.5 vs derived HealthImpactScore", xaxis=list(title="PM2.5"), yaxis=list(title="HealthImpactScore (1-5)"))

Insight: Clear positive correlation between PM2.5 and health risk.

#8.Violin / distribution of HealthImpactScore by Region

plot_ly(df, x=~Region, y=~HealthImpactScore, type='violin', color=~Region) %>%
layout(title="Distribution of HealthImpactScore by Region", xaxis=list(tickangle=-30))

Insight: Demonstrates variability in health impact distribution across regions. Regions like Africa, south-east Asia show broader, higher-risk distributions.

#9.Heatmap: Region × ResidenceType mean PM2.5

heat_df <- df %>% group_by(Region, ResidenceType) %>% summarise(mean_PM25 = mean(PM25, na.rm=TRUE)) %>% ungroup()

#pivot to matrix for plotly heatmap

heat_wide <- heat_df %>% pivot_wider(names_from = ResidenceType, values_from = mean_PM25, values_fill = 0)
zmat <- as.matrix(heat_wide %>% select(-Region))
plot_ly(x = colnames(zmat), y = heat_wide$Region, z = zmat, type = "heatmap", colorscale = "Viridis") %>%
layout(title="Mean PM2.5 by Region (rows) and Residence Type (columns)", xaxis=list(title="Residence Type"), yaxis=list(title="Region"))

Insight: Reveals that cities and urban regions within certain continents contribute the most to pollution levels.

#10.Boxplot: PM2.5 by Region

plot_ly(df, x=~Region, y=~PM25, type='box', color=~Region) %>%
layout(title="PM2.5 distribution by Region", xaxis=list(tickangle=-30))

Insight: Confirms inter-regional disparity — Eastern Mediterranean and South-East Asia dominate the upper quartiles.

#11.Small multiples: 6-country yearly trends (if Year exists)

top6 <- top_countries %>% slice(1:6) %>% pull(Country)
cmp <- df %>% filter(Country %in% top6) %>% group_by(Country, Year) %>% summarise(mean_PM25 = mean(PM25, na.rm=TRUE))
plot_ly(cmp, x=~Year, y=~mean_PM25, color=~Country, split=~Country, type='scatter', mode='lines+markers') %>%
layout(title='Yearly PM2.5 for top 6 countries')

Insight: Shows if heavily polluted countries are improving or worsening over time.

#12.Correlation-like summary: parameter influence (here PM2.5 only)

region_rank <- df %>% group_by(Region) %>% summarise(mean_health_score = mean(HealthImpactScore, na.rm=TRUE), mean_PM25 = mean(PM25, na.rm=TRUE)) %>% arrange(desc(mean_health_score))
plot_ly(region_rank, x=~Region, y=~mean_health_score, type='bar', color=~mean_PM25, colorscale='YlOrRd') %>%
layout(title="Region mean HealthImpactScore (colored by mean PM2.5)")
region_rank
## # A tibble: 6 × 3
##   Region                mean_health_score mean_PM25
##   <chr>                             <dbl>     <dbl>
## 1 Eastern Mediterranean              4.13      38.4
## 2 South-East Asia                    3.85      32.1
## 3 Africa                             3.78      29.8
## 4 Europe                             2.84      19.2
## 5 Western Pacific                    2.58      15.5
## 6 Americas                           2.49      14.6

Insight: Regions with high PM2.5 also show high average health impact scores — confirming PM2.5 as the dominant influencing parameter.

#13.Interactive data table for queries

DT::datatable(df %>% select(Country, Region, ResidenceType, Year, PM25, HealthCategory, HealthImpactScore) %>% arrange(desc(PM25)),
options = list(pageLength = 10, scrollX = TRUE))

Insight: Allows users to query countries, filter regions, and directly view pollution-health relationships interactively.

#14.Bubble chart: HealthImpactScore vs mean_PM25 by country (size = count)

bubble <- df %>% group_by(Country) %>% summarise(mean_PM25 = mean(PM25, na.rm=TRUE), mean_health = mean(HealthImpactScore, na.rm=TRUE), n=n())
plot_ly(bubble, x=~mean_PM25, y=~mean_health, size=~n, text=~Country, type='scatter', mode='markers', marker=list(sizemode='area')) %>%
layout(title="Country-level: mean PM2.5 vs mean HealthImpactScore (size = observations)")

Insight: Visualizes both PM2.5 intensity and data frequency. Larger bubbles indicate more complete data; color and position reflect health severity

#Save cleaned dataset

write_csv(df %>% 
            select(Country, Region, ResidenceType, Year, PM25, HealthCategory, HealthImpactScore, lat, long), 
          "who_pm25_cleaned.csv")
cat("Saved cleaned dataset to who_pm25_cleaned.csv\n")
## Saved cleaned dataset to who_pm25_cleaned.csv
# Read the CSV back into R
who_pm25_cleaned <- read_csv("who_pm25_cleaned.csv")
head(who_pm25_cleaned, 10)  # shows first 10 rows
## # A tibble: 10 × 9
##    Country     Region ResidenceType  Year  PM25 HealthCategory HealthImpactScore
##    <chr>       <chr>  <chr>         <dbl> <dbl> <chr>                      <dbl>
##  1 Kenya       Africa Cities         2019  10.0 Moderate                       2
##  2 Trinidad a… Ameri… Rural          2019  10.0 Moderate                       2
##  3 United Kin… Europe Cities         2019  10.1 Moderate                       2
##  4 Grenada     Ameri… Total          2019  10.1 Moderate                       2
##  5 Brazil      Ameri… Towns          2019  10.1 Moderate                       2
##  6 Denmark     Europe Urban          2019  10.1 Moderate                       2
##  7 Russian Fe… Europe Cities         2019  10.2 Moderate                       2
##  8 Spain       Europe Cities         2019  10.2 Moderate                       2
##  9 Grenada     Ameri… Towns          2019  10.2 Moderate                       2
## 10 Grenada     Ameri… Urban          2019  10.2 Moderate                       2
## # ℹ 2 more variables: lat <dbl>, long <dbl>

Click here to download the cleaned CSV

  1. Discussion of Results

Key Observations PM2.5 concentration is highest in urban and city zones. Africa, Eastern Mediterranean and South-East Asia display the most alarming PM2.5 averages. HealthImpactScore is directly correlated with PM2.5, making it the most critical influencing parameter. Interactive maps and plots revealed nuanced differences that static charts could not — especially the spatial relationships between air quality and health risk.

Parameter Influence PM2.5 concentration was the most influential factor determining health outcomes. The “Regional Correlation Summary” and “PM2.5 vs HealthImpactScore” plots both confirmed a near-linear rise in risk with increasing pollution levels.

Overall Insight: Interactive visualization revealed deeper patterns than static charts could — connecting pollution exposure, regional disparities, and health risk in one visual framework.

  1. References

World Health Organization. (2022). Ambient Air Pollution Database (PM2.5). Retrieved from https://www.who.int/data/gho/data/themes/air-pollution Plotly Technologies Inc. (2024). Plotly R Open Source Graphing Library. Retrieved from https://plotly.com/r/ Leaflet for R. (2024). Interactive Maps with R. RStudio. Retrieved from https://rstudio.github.io/leaflet/ Bokeh Developers. (2023). Rbokeh: An R Interface to Bokeh Visualization Library. Retrieved from https://hafen.github.io/rbokeh/ R Core Team. (2023). R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing.